Introduction

New York subway, one of the main public transportations for New Yorkers, provides super convenience for local citizens, at the same time, brings potential danger to passengers, where criminals are attracted to busier subway stations for certain kinds of crime like pickpocketing, grand larceny, and assault. This closest place will trigger evil.

Wordcloud using victims description

On November 21, around 12:00 AM, at 34th Street-Penn Station in Manhattan, Alkeem Loney, a 32-year-old male, was stabbed in the neck during an unprovoked attack and was pronounced dead later as NYPD stated. The deadly incident is the latest in a pate of violence underground that comes as the MTA tries to get commuters back on mass transit. The horrible crime event raised lots of public concern about the safety at subway stations, the safety is tightly related to almost every citizen who is living, working, and studying in New York City.

As students who are living here in New York City, most of us will almost take the subway to the campus in the early morning and back to the apartment in the night on weekdays, and hang out with friends on weekends. However, some of my friends experienced uncompleted crimes. Keeping away from danger at subway stations is closely related to ourselves. We hope we are able to help citizens to find comparatively safe and reliable routes when taking subways.

Data

Data Introduction

Subway Crime

The orginal subway crime data has two parts.The first one contains all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department— NYPD. The second one  includes similar crimes. We join these two data frames and only analyze crimes which happen in subway, NYC.

The variables we use are(some useless variable’s meaning can be found in the link above):

column name description type
CMPLNT_NUM Randomly generated persistent ID for each complaint Number
CMPLNT_FR_DT Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists) Date & Time
CMPLNT_FR_TM Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists) Plain Text
CMPLNT_TO_DT Ending date of occurrence for the reported event, if exact time of occurrence is unknown Date & Time
CMPLNT_TO_TM Ending time of occurrence for the reported event, if exact time of occurrence is unknown Plain Text
OFNS_DESC Description of offense corresponding with key code Plain Text
LAW_CAT_CD Level of offense: felony, misdemeanor, violation Plain Text
SUSP_AGE_GROUP Suspect’s Age Group Plain Text
SUSP_RACE Suspect’s Race Description Plain Text
SUSP_SEX Suspect’s Sex Description Plain Text
Latitude Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) Number
Longitude Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326) Number
STATION_NAME Transit station name Plain Text
VIC_AGE_GROUP Victim’s Age Group Plain Text
VIC_RACE Victim’s Race Description Plain Text
VIC_SEX Victim’s Sex Description Plain Text

Subway Passenger

The orginal Subway passenger data is from MTA(Metropolitan Transportation Authority). The orginal data contains total entries and exits in each station in every 4 hours from 2010 to now. Data is not in a readable format, they are seperated by time in different htmls, we read and process passenger data with GenerateSubwayPassengerData.rmd

The variables we use are:

colum name description type
STATION station name Character
LINENAME lines in this station, there can be more than one lines in one station Character
DATE format MM/DD/YYYY Date
TIME format HH:MM:SS Date
ENTRIES cumulative entries Intergar
EXITS cumulative exits Intergar

Data Cleanning

Subway Crime

the Least Distance

In order to compare crime and subway passengers’ data, we find that we need to transfer to the same subway line and station name.(Different stations have different abbreviation.)
We use the crime data’s latitude and longitude to match the subway’s data. The station in the subway information closet to the each row of crime data will be matched. (which has information about all the station’s name, line and location.)
Some crime data who have deviant longitude and latitude will be excluded.

Subway Passenger

K-Means

We set the number of clusters to be 8 and use Kmeans to cluster latitudes and longitudes. After K-means we have 8 clusters of locations instead of the original 4 boro, making it closer to reality (for instance we have lower, middle and upper Manhatten in the clusters)and better for model classification. The kmeans code is in PassengerEDA.Rmd

Imputation

Some missing data from passenger’s exit and enter count, we use mean of former values to impute them. The imputation code we use is FutherCleanPassenger.py

Google Map Api to find station coordinates

We want to get coordinates of each station for the following reasons

  • location-based data visualization and analysis
  • More location-based features for the model
  • The station name in crime and passenger data are not matched, we can use corrdinates to match them

However, how to get the correct coordinates is tricky, there are open datas about NYC subway stations infomation and all of them have different naming system with ours. In addition, the station names contain lots of dupilicates. For instance, there are 2 86 st stations in middle Manhattan and another one in Brooklyn. We can get the correct coordinates of stations by using both station names and line names. Therefore, our solution is to use Google Maps Api. The code we use is Subway_info.py

Add service column

There are too many subway lines and some of them share most of the rails, therefore it is not reasonable to conduct analysis or building models with the line name. Therefore, we created a new variable called service based on the defination of MTA. For instance, line A, B and C are called ‘8 Avenue’.


Correct subway line

According to the New York City Subway instruction, there are several different transfer between lines. The first is the inside transfer, where you can transfer from one line to other line inside the station. For example, 14 St-Union Sq is a station of Line LNQRW456. We don’t need to some adjustment for these stations. The second one is free subway transfer and free out-of-way-system. This transfer is different from the inside transfer, passengers need to move from one station to other station for transfer. The data of these transfers has some problems. For example, there are free subway transfer between Court ST-23 ST(EM) and Court Sq(G7). However, the dataset shows the station and line is Court ST-23 ST:EGM, Court Sq:EGM, Court Sq:7. To deal with problem like this, we reassigned the line of station with free subway transfer or free out-of-way-system according to the New York City Subway instruction. In this case, we only consider the insider transfer station.

Outliers of entries and exits

For each station and given time, We got the actual entries and exits by calculating the difference of cumulative entries and exits between current time and last time. However, final results contains some outliers, some entries and exits are negative or extremely large. For these outliers, we replaced them with the mean of last two observations at the same time and station. We did this by FutherCleanPassenger.py.

Exploratory Data Analysis

We conduct EDA to find the trends of data and provide insights for model.

Subway Crime

New York City can be a dangerous place and crime from above ground will often extend into the NYC Subway.
We mainly focus on the recent crime data on subway in NYC in this year, and there are 124439 complaints from 2006 to now.

Crime by Location

Heat Map of Subway Crime in NYC, 2006-2021

From this map, you can check where the crime happened frequently.

Map of Subway Crime in NYC, 2006-2021

From this map, you can check each crime’s location, type, victim and suspects’ information and time.

Distribution of crime in 7 Clusters

When firstly scanning the map above, you can ambiguously know how many crimes in each part of NYC, so let us check them in each ambiguously using bar chart.

Top 10 offense classification

There are 50 kinds of crime occurring in the subway, there is the bar chart shows the wildest 10 crimes in the subway.

The most frequent crime mainly consists of grand larceny and assaults.

Top 20 station where crime happens frequently

From this chart, you can mainly check which station is the most dangerous station.

Barchart by each Borough about Victims

From this graph, you can check in each borough, which races more possibly vulnerable in the subway.

As you can see, the proportion of African Americans in each cluster are stable; In Manhattan, where has the most crimes, white people(including white hispanic) are more vunlerable than African Americans in these places.

Female Age Distribution for Sex Crimes

Let us talk more about age group for some specific crimes: SEX CRIMES and HARRASSMENT 2.
Most of the victims’ age are in the age 25-44 interval.

Crimr Rate Top 20

Sometimes, we more care about the crime rates on subway rather than number of crimes, because we also care the possiblity that the people standing in front of us is suspective.

linename station flow crime crime_rate
AS BROAD CHANNEL 230815 271 0.0011741
FM 6 AV 1049168 1096 0.0010446
23 125 ST 10012389 2900 0.0002896
3 NOSTRAND AV 4981435 1436 0.0002883
1 FRANKLIN ST 4151331 1035 0.0002493
1 23 ST 13849705 2870 0.0002072
7 74 ST-BROADWAY 5244829 1068 0.0002036
1 CANAL ST 6143790 1061 0.0001727
1 125 ST 8760782 1437 0.0001640
25 INTERVALE AV 4020500 621 0.0001545
ACJLZ BROADWAY JCT 10856301 1584 0.0001459
6 3 AV 138 ST 5030499 730 0.0001451
J 111 ST 1578616 208 0.0001318
BD 182-183 STS 2054457 241 0.0001173
4 KINGSBRIDGE RD 8467318 922 0.0001089
25 JACKSON AV 3582715 381 0.0001063
L ATLANTIC AV 1447978 147 0.0001015
23 CENTRAL PK N110 9509973 951 0.0001000
F AVENUE P 1970015 194 0.0000985
C 155 ST 2984168 282 0.0000945

Crime by Time

Subway Passenger

Subway passenger EDA with location

Passenger flow is closely related with crime. The more passenger flow in a station, the more criminals there will be. Therefore, we conduct EDA to:

  • Find relationship between location and passenger flow
  • Determine the most appropriate location variable for the model

Total passengers in each station

The color of each circle is the line of the subway and the size is the total number of passengers in 2021.

df = 
  passenger_df %>% 
    drop_na(entry_diff_imputed, exit_diff_imputed) %>% 
    group_by(station, service, linename, sublocality, postal_code, lat, long) %>% 
    summarise(total_entry = sum(entry_diff_imputed),
              total_exit = sum(exit_diff_imputed)) %>% 
    mutate(passenger_flow = total_entry + total_exit,
           # set passenger_flow to int
           passenger_flow = as.integer(passenger_flow))

# df %>% 
#   leaflet() %>% 
#   addTiles() %>% 
#   addCircleMarkers(~long, ~lat,radius= df$passenger_flow/100000000, weight= 0.9)


qpal <- colorQuantile("YlOrRd", df$passenger_flow, n = 4)

pal <- 
   colorFactor(palette = c("blue", "azure4", "orange",'green','green','brown','yellow','red','forestgreen','purple'), 
               levels = c('8 Avenue(ACE)',
                          'Shuttle(S)',
                          '6 Avenue(BDFM)',
                          'Brooklyn-Queens Crosstown(G)',
                          'Brooklyn-Queens(G)',
                          '14 St-Canarsie(L)',
                          'Broadway(NQRW)',
                          '7 Avenue(123)',
                          'Lexington Av(456)',
                          'Flushing(7)'))


df %>% 
  mutate(service = ifelse(service == 'Brooklyn-Queens Crosstown(G)', 'Brooklyn-Queens(G)', service)) %>% 
  mutate(passenger_flow2 = 10*log(passenger_flow)) %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(lng = ~long, lat = ~lat, weight = 1, stroke = FALSE,
    radius = ~sqrt(passenger_flow)/20, popup = ~station, color = ~pal(service), opacity = 0.75, fillOpacity = 0.75) %>%
  addLegend("topright", pal = pal, values = ~service, 
            title = "Subway Service", opacity = 0.75) %>% 
  setView(-73.8399986, 40.746739, zoom = 11)

There are patterns between station locations and total passenger flow. Big stations are mostly located in lower and middle Manhattan, and there are some sub center stations in other areas, such as 9th Street station in Brooklyn and FLUSHING-MAIN station in Queen.

By sublocality

df %>% 
  ungroup %>% 
  filter('sublocality' != 'None') %>% 
  drop_na() %>% 
  group_by(sublocality) %>%  
  summarise(passenger_flow = sum(passenger_flow)) %>% 
  mutate(sublocality = as.factor(sublocality)) %>% 
  arrange(-passenger_flow) %>% 
  filter(passenger_flow < 500000 |passenger_flow > 60000000) %>% 
  knitr::kable()
sublocality passenger_flow
Manhattan 890710230
Brooklyn 288469501
Queens 174401767
Bronx 98407980
Staten Island 297340

Manhattan has the most subway passengers in 2021 and Staten Island has the least subway passenger_flow. Additionally, sublocality only has 5 levels, which is too few for a machine learning model.

EDA with zipcode

Total passengers in each zipcode
# cache zip boundaries that are download via tigris package
options(tigris_use_cache = TRUE)


# get zip boundaries that start with 282
char_zips = zctas(cb = TRUE)
char_zips = 
  char_zips %>% 
  rename(postal_code = GEOID10)

summary_df<-
  df %>%
  mutate(postal_code) %>% 
  group_by(postal_code) %>%
  summarise(passenger_flow = sum(passenger_flow),
            station_cnt = n_distinct(station, linename)) 


summary_df<-geo_join(char_zips, 
                      summary_df, 
                      by_sp = "postal_code", 
                      by_df = "postal_code",
                      how = "left") %>% 
  filter(passenger_flow>=0)

pal <- colorNumeric(
  palette = "Greens",
  domain = summary_df$passenger_flow,
  na.color = "white")

labels <- 
  paste0(
    "Zip Code: ",
    summary_df$postal_code, "<br/>",
    "Flow of Passengers: ",
    summary_df$passenger_flow) %>%
  lapply(htmltools::HTML)

# summary_df2 = 
#   char_zips %>% 
#     select(postal_code) %>% 
#     left_join(summary_df, by = 'postal_code') 

summary_df %>%  
  mutate(postal_code_int = as.integer(postal_code)) %>% 
  filter(postal_code_int >= 10000 & postal_code_int < 14900) %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
   addPolygons(fillColor = ~pal(passenger_flow),
              weight = 2,
              opacity = 1,
              color = "white",
              dashArray = "3",
              fillOpacity = 0.7,
              highlight = highlightOptions(weight = 2,
                                           color = "#666",
                                           dashArray = "",
                                           fillOpacity = 0.7,
                                           bringToFront = TRUE),
              label = labels) %>% 
  addLegend(pal = pal, 
            values = ~passenger_flow, 
            opacity = 0.7, 
            title = htmltools::HTML("Total Passengers 2021"),
            position = "bottomright") %>% 
  setView(-73.8399986, 40.746739, zoom = 10)
Total subway stations in each zipcode
labels <- 
  paste0(
    "Zip Code: ",
    summary_df$postal_code, "<br/>",
    "Stations count: ",
    summary_df$station_cnt) %>%
  lapply(htmltools::HTML)


pal <- colorNumeric(
  palette = "Purples",
  domain = summary_df$station_cnt,
  na.color = "white")

summary_df %>%  
  mutate(postal_code_int = as.integer(postal_code)) %>% 
  filter(postal_code_int >= 10000 & postal_code_int < 14900) %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
   addPolygons(fillColor = ~pal(station_cnt),
              weight = 2,
              opacity = 1,
              color = "white",
              dashArray = "3",
              fillOpacity = 0.7,
              highlight = highlightOptions(weight = 2,
                                           color = "#666",
                                           dashArray = "",
                                           fillOpacity = 0.7,
                                           bringToFront = TRUE),
              label = labels) %>% 
  addLegend(pal = pal, 
            values = ~station_cnt, 
            opacity = 0.7, 
            title = htmltools::HTML("Total Stations 2021"),
            position = "bottomright") %>% 
  setView(-73.8399986, 40.746739, zoom = 10)

The zipcode does not demonstrate the exact relationship between location and passenger flow. For instance, some zipcodes such as 10002 and 10011 in lower Manhattan should have more passengers, however few stations are built there. Therefore, the key cause to this confusion is that subway stations are not built based on zipcode.

Kmeans analysis of station

We set the number of clusters to be 8 and use Kmeans to cluster latitudes and longitudes. The color of each circle is the Kmeans cluster they belong and the size is the total number of passengers in 2021.

# conduct kmeans
df_sub = 
  df %>% 
  ungroup() %>% 
  select(long, lat) %>% 
  drop_na()

k2 = kmeans(df_sub, centers = 8, nstart = 25)

# EDA with Kmeans results
df$cluster = k2$cluster

df = 
  df %>%
  mutate(cluster = case_when(
    cluster == 1 ~ 'Queen',
    cluster == 2 ~ 'Upper Manhattan',
    cluster == 3 ~ 'Queen-Brooklyn',
    cluster == 4 ~ 'Middle Manhattan',
    cluster == 5 ~ 'Bronx',
    cluster == 6 ~ 'Brooklyn',
    cluster == 7 ~ 'Lower Manhattan',
    cluster == 8 ~ 'Rockaway Beach',
  ))

pal = colorFactor(
  brewer.pal(n = 10, name = "Set1"),
  df$cluster,
  levels = NULL,
  ordered = FALSE,
  na.color = "#808080",
  alpha = FALSE,
  reverse = FALSE
)

df %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(lng = ~long, lat = ~lat, weight = 1, stroke = FALSE,
    radius = ~sqrt(passenger_flow)/20, popup = ~station, color = ~pal(cluster), opacity = 1, fillOpacity = 1) %>%
  addLegend("topright", pal = pal, values = ~cluster, 
            title = "Kmeans Cluster", opacity = 1) %>% 
  setView(-73.8399986, 40.746739, zoom = 11)

Kmeans algorithm cluster Manhattan into three parts: lower Manhattan, middle Manhattan and upper Manhattan. Brooklyn and Queens shares 3 cluster. Also, there is a cluster for Bronx. We think the Kmeans result is easier to interpret than that of zipcode or sublocality and it can partly represent the relationship between passenger flow and location. Therefore, we use Kmeans result as the location variable in our model.

Model

Introduction

Methodology

Reult

Model Application

We build the (No crime Navigation)[https://zheyanliu.shinyapps.io/NYC_subway_findroute/] APP based on Google Maps Api and GNN model.

Input parameters

In the left panel, users can select their infomation and typed in their current location and destination. This include:

  • Who are you: gender, age and race
  • When you leave: date and time
  • Where to go: your location and destination

Routes

When user input their infomation click on the submit button several candidates routes will be displayed in the right table. The table shows several infomation of the route including:

  • time:time to get to the destination from start location
  • walking distance: walking distance in this route
  • crime score:the likelihood of being the victim of crime events
  • crowdness score:crowdness on the route
  • line[stops]:brief introduction to this route, take how many stops in each line

Interactive route map

Users can click on each row in the routes dataframe to show the detail of this route in the map. Users can select multiple lines.

Summary

Results

Limitations

Data limitations

The crime data is merged by two dataset, one from 2006-2020 and the other from 2021 to 2022. Although they have similar columns and both contains all the variables we interested in, the defination of crime and collection of data still have differences.

Original passenger data only contains cumulative entries and exits. When taking difference from entries and exits, the diff contains negative and unreasonably large values. We imputed the erroneous.

Another problem is that station names in the crime data and original data cannot match. We matched them based on external data source (Google Map Api). We use Api to get exact coordinates of station and category crime data to a station. Mismatch can happen in this case.

Model limitations

Acknowledgement

We would like to thank Zhuohui Liang, who gives us suggestions about this project. In addition, we want to thank team ‘Police Violence and Protest’ last year. Their interactive map gives us the idea of building a interactive crime map. Moreover, we would like to thank Rebekah Hughes in this team for her answering our question about Shiny Dashboard Navbar. Finally, we want to thank the (Google Map Api team)[https://github.com/googlemaps/google-maps-services-python] for their open-source code and free to use api, the No (crime Navigation App)[https://zheyanliu.shinyapps.io/NYC_subway_findroute/] will not be possible without their contributions.

Reference